Grid Processing Electronic Memory

ABSTRACT

The invention adds execution units to a conventional memory bank, and connects memory units in two dimensions in a grid. Highly enabled execution units occur in every row in the grid; individual grid units are also enabled to varying degrees with reduced execution capabilities. The multiple execution units follow a single instruction. Row-aligned or vector operations, and cross-row or vector-internal operations, can be performed simultaneously without crossing the front-side bus. Constant-time list copying and read-write array accessing, and linear- and sublinear-time sorting are possible as a result. Extended content-addressing is introduced. The running times of matrix multiplication and Gaussian elimination are improved by factors of the size of the matrix.

BACKGROUND OF THE INVENTION

Field of the Invention

The present application relates to computer memory device, in particular, to a novel design of computer memory device that performs computational operations in units that are located within the memory bank. Like a vector processor, multiple computational operations are performed in replicated computation units and performed simultaneously.

Description of Related Art

Using conventional designs to add the respective items in two lists, the numbers must be transferred from memory to the processor one at a time, the operations performed one at a time, and the results transferred back one at a time. Other operations, such as sorting, searching, summing, or copying a list, have the same limitation; and in the case of sorting, each transfer might be performed many times. Therefore, serial transferring on the front-side bus is a bottleneck. Serial processing is also a problem. Even with data that already exist in processor-local registers, the processor can only perform one operation at a time.

There are many existing improvements on the basic conventional design in existence today. One kind of “vector processor”, such as the MMX, SSE, or AltiVec processors, transfers data from memory to the processor and performs operations 4 at a time instead of one (see http://en.wikipedia.org/wiki/Vector_processor). A “physics processing unit” consists of an “array of custom floating-point units” (see http://en.wikipedia.org/wiki/Physics_processing_unit). Some supercomputers had “many limited functionality processors that would work in parallel”, doing “64,000 multiplications on 64,000 pairs of numbers at a time” (see http://en.wikipedia.org/wiki/SIMD). Content-addressable memory or associative memory “is designed such that the user supplies a data word and the CAM searches its entire memory to see if that data word is stored anywhere in it” (see http://en.wikipedia.org/wiki/Content-addressable_memory).

“Random access memory (RAM) is a form of computer data storage . . . that allow[s] stored data to be accessed in any order” (see http://en.wikipedia.org/wiki/Random-access_memory). “A front-side bus (FSB) is a computer communication interface . . . [which] typically carries data between the central processing unit (CPU) and a memory controller hub.” “Simple symmetric multiprocessors place a number of CPUs on an FSB, though performance does not scale linearly due to the architecture's bandwidth bottleneck” (see http://en.wikipedia.org/wiki/Front-side_bus).

BRIEF SUMMARY OF THE INVENTION

Many or most software programs perform a repeated sequence of steps on multiple data. In this case, the data has a parallel structure. With this invention, the respective steps of the sequence can be performed on all the data simultaneously. The invention has two major divisions: one, performing operations in a memory bank itself, instead of transferring data to the central processor, enabling many simultaneous operations; two, connecting the storage and computation units in two dimensions, enabling list-internal or cross-row operations. Operations performed on respective rows simultaneously but not involving other rows are vector operations. The applicable instructions are a subset of computer instructions, only applying to single-instruction multiple-data operations. This subset is free of the synchronization demands and overhead associated with general multiprocessing.

Conventional processing models interpret a memory module as a single long indexed list. In this invention the memory design interprets it as a grid: individual information storage units or “cells” are assigned indices in two dimensions. Rows and columns are formed consisting of all the cells that have been assigned the same index in either dimension. Pathways are made to connect the cells in the same rows and columns.

On one instruction, the memory bank performs one type of operation replicated in many replicated units in the bank simultaneously, like a vector processor. However, the elimination of the front-side bus allows a greater range of vector operations and an asymptotic speed improvement.

Hybridized forms of the following operations are described.

-   -   Unary, binary, and ternary arithmetic operations     -   Content-addressable and searching functions     -   Multiple extended forms of copying and read-write accessing         These will form the “ABC”s of hybrid processing. Combinations of         them are expected to improve the speed of many common tasks. For         example, sorting tasks, discussed later, can make use of them         for faster performance. The responsibilities for using it that         are delegated to the manufacturer of the hardware, the operating         system, and the software authors are also briefly discussed.

On account of spatial limitations, individual cells are not enabled with the full range of computational capabilities otherwise available to a central processor; these are delegated to computation units located at the end of every row. However, cells are able to perform computations in a limited selection: comparison operations in cells, especially for equality, are expected to be the most widely used.

The description herein makes use of the notion of a list in memory. Items in lists can exist in cells that are consecutive in either dimension. For the discussion, lists will be depicted as residing in a single row or column, and the items will be numbers. In reality, lists will sometimes span multiple columns, causing a limit in the design's speed improvements; however the improvements will still be substantial: a constant factor of the number of rows in a column in the bank.

Asymptotic speed is described by reference to a common notation in the computer field, “Big Oh”. A process is described as running in O(F(N)) time, for some function F(N), for a list of N items, meaning that the process takes an amount of time proportional to F(N), even given the worst possible list of N, that input which causes the process to take the most time possible, when N can also be arbitrarily large.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows storage units connected in a grid with computation units occurring once per row.

FIG. 2 shows an example configuration of storage units with varying computational capabilities.

FIGS. 3A and 3B show the flow of operands and outputs in an arbitrary concurrent unary operation.

FIGS. 4A, 4B, and 4C show the flow of operands and outputs in an arbitrary concurrent binary operation with three configurations of arguments.

FIGS. 5A, 5B, 5C, 5D, 6A, and 6B show the flow of operands and outputs in an arbitrary concurrent ternary operation with six configurations of arguments.

FIG. 7 shows concurrently filling a list with its row indices.

FIGS. 8A and 8B show concurrent list reads and writes.

FIGS. 9A, 9B, and 9C show three variations on content-addressing or concurrent search operations: counting, match indexing, and match collation.

FIGS. 10A and 10B show direct and deferred forms of folding or reducing operations.

FIG. 11 shows a concurrent list transpose operation.

FIGS. 12A, 12B, 12C, and 12D show a concurrent transpose, a concurrent reverse transpose, two consecutive concurrent transposes, and an alternate implementation.

FIG. 13 shows a non-consecutive concurrent transpose write.

FIG. 14 shows a non-consecutive concurrent transpose read.

FIG. 15 shows a concurrent word-shift or byte-shift.

FIG. 16 shows a concurrent bubble sort.

FIG. 17 shows a concurrent insertion sort.

FIGS. 18 and 19 show the merge step in a concurrent merge sort.

FIG. 20 shows a Cartesian sort.

DETAILED DESCRIPTION OF THE INVENTION

In this application, the subset of possible operations in the examples consists of those operations in which data can travel along independent pathways, and be handled independently.

In an example of the “time-space trade-off” often encountered in software, further capabilities of the cells will result in faster speeds, but lower storage densities on a given chip. Losses of density of up to 90% or more in favor of computational capabilities could be possible, only offering 10 gigabytes of storage on a chip that would otherwise hold 100 gigabytes; as such the particular end user's purposes are relevant.

The process shown in FIG. 11 is expected to be the most widely demanded new operation, due to its ability to copy lists to arbitrary locations in constant time. The spatial requirements for it are low, but can be implemented in two ways, as shown in FIG. 12D. One implementation adds circuitry to cells, those necessary to determine whether two of the cells' respective inputs are equal or not. Another implementation adds instead a diagonal signal wire to activate the transposing behavior in the cells.

Additional capabilities in the per-row computation units do not have the same demands on space as those offered per-cell. Multiple storage units can make use of the capabilities while not costing additional space themselves. Therefore, the per-row units for vector operations are expected to be the next most widely demanded, including: unary, binary, ternary, searching, and folding operations, as well as the two slower sorting algorithms, shown in FIGS. 16 and 17, at only the cost of a wider first column.

Further enabled cells will provide the remainder of the operations at the cost of density as the manufacturer desires. The former implementation of the process in FIG. 11, per-cell equality comparison, also makes available non-consecutive read and write, and non-consecutive forward and reverse transpose. Multiple concurrent searching operations, such as simultaneously searching several lists occurring in overlapping rows, or multi-column searching, would also be available when searching on equality criteria; the respective circuits are required in cells for multi-column versions of any operation with a row list as an argument. In the case of greater-than (GT) comparison in cells, in addition to multi-column vector comparisons, the two faster sorting algorithms, shown in FIGS. 18, 19, and 20, also become available.

Two useful arithmetical operations not depicted in the diagrams that are expected to benefit from the invention are matrix multiplication and Gaussian and Gauss-Jordan elimination. The running time of these operations can be improved, by factors depending on the operations that are available in the cells. The argument matrices are not required to be square. For matrices M×P and P×N, conventional multiplication takes O(MNP) time. With the following capabilities available respectively, running time can be brought to:

-   -   Row-only operations: O(MN) vector multiplications, O(MN) folding         sums     -   Cell addition: O(MN) vector multiplications, O(1) folding sum     -   Cell multiplication and addition: O(M+N) list copies, O(1)         vector multiplications, O(1) folding sum

Gaussian elimination takes O(N̂3) time conventionally for a matrix with N rows. It can be brought to:

-   -   Row-only operations: O(N̂2) vector multiplications, O(N̂2) vector         additions     -   Cell multiplication and addition: O(N) vector multiplications,         O(N) vector additions

An entire bank need not be enabled with an operation for it to be available: if cells enabled with either form of transpose occur somewhere in a row, then the entire row can use it, and copy a list to other cells which are. For example, if a 16×4 block of cells are enabled with vector multiplication and folding addition, and every row is capable of transposing a list of size 4, then the two matrices can be copied to the enabled block and multiplied faster than they could be with row-only operations, while still consuming very little extra space.

In addition, Trigonometry functions are single-argument unary functions, but are computed as composites. Numerical methods can be used to derive the functions' values, but the values can also be precomputed and stored in tables, simplifying the computations. To provide row-based access to the values, the tables could be replicated in the per-row computation units, or made available as dedicated lists, to be accessed and operated on with the operations in our design.

As is, a memory bank enabled with the invention could serve as a normal memory bank, albeit less dense and potentially slower depending on other factors. The operating system could make moderate use of a memory bank enabled with the invention in implementing standard C-Language library functions such as “memcpy”, “memmove”, “memset”, “memcmp”, “memchr”, and other C string and array functions, without needing changes to application software. A memory bank enabled with the invention on a video card, or other peripherals, could implement specific driver functions with the designs for the operations in the invention, transparently to application or operating system software. Low-level software such as the operating system and device firmware would need memory management routines such as “malloc” and “free” that can accommodate the two-dimensional layout, as well as possibly allow one or more extra parameters to specify which operations will be needed by the block the caller is requesting. It's possible that the operating system or CPU could perform some run-time code analysis to convert a serialized vector operation into the equivalent instruction for the memory bank transparently.

FIG. 1 depicts the grid layout of a memory bank. Storage units, “cells”, which can be comprised of one or several bits, populate the rows and columns in the grid. The storage in a cell will usually consist of a small number of information storage units acting together to store a single data item, such as a byte. Cells shown in the same row or column have the same row or column indices, respectively. Cells in the same row or column are connected by horizontal or vertical transfer pathways respectively. The primary computation units are located replicated at the end of every row, shown on the left. These units, occurring on a per-row basis, have access to the cells in their respective rows. These units are also connected by a vertical transfer pathway, not shown.

FIG. 2 shows that the memory cells can be enabled with certain computational operations, usually minimal such as comparison and addition only, in addition to their functions as storage elements. Different combinations of capabilities in the cells will constitute different products of the manufacturer. Several options or combinations of variations on the design could be made available, specialized for different purposes. The units at the end of the rows will be enabled with many or most computational operations, while cells will be enabled with few or none. As shown in the diagram, some cells will have no computational abilities; some will be able to perform for example comparison for equality between values originating in their own storage, or arriving along transfer pathways. Some cells will be able to perform addition, while others can perform multiplication, yielding faster speeds than per-row operations, though occupying correspondingly more space. It's also possible that the per-row computation units could be enabled with operations to different degrees as well.

Next we discuss vector operations, simultaneous and independent computations on the respective items in a list or lists. The operations have many variations, due to the possible combinations of arrangements of arguments. In general, an argument to an operation can be a column list, a row list, or a single value, sometimes loosely called a “register” value. In addition, if cells possess the corresponding abilities, then multiple operations can be simultaneously performed in multiple locations, even in overlapping rows, sometimes called “multi-column” or “cell-wise” operations.

FIGS. 3A and 3B show about conducting a unary operation: negation of the numbers in an entire list. The list exists in a column. The computation units at the end of the rows receive the instruction. They read the numbers in their respective rows in the list from the column in which the list resides, via replicated horizontal pathways. The computation units compute the negatives of their respective contents, and emit them along second parallel pathways. The respective cells in another column commit to storage the values present on these pathways. The result is a new vertical list, containing the respective negatives of the earlier one. This process runs in O(1) time, faster than conventional by a factor the size of the list. FIG. 3B shows a simplified view of the operation with certain details omitted. If the identify function is applied, the effect is an aligned copy. If a single value is the argument, then list filling is performed, such as to create a new list filled with a single value.

FIGS. 4A, 4B, and 4C show about performing three variations on a binary operation, using addition as an example. The first is simultaneous addition of the respective items in two lists. The lists exist in two columns, beginning at the same row, and ending at the same row. Again, the computation units at the end of the rows receive the instruction. They read the numbers in their respective rows in the lists from the columns in which the lists reside via two parallel horizontal pathways. They add the numbers and emit the sums along a third parallel pathway. The respective cells in a new column commit to storage the values present on it. The end result is a new list, containing the sums of the respective numbers in the originals, taking only O(1) time.

Binary operations have further variations. The next variation, FIG. 4B, replaces one of the lists with a single value. The computation units read the respective items in the list, being the only list argument, from the column the list resides in. They also read the single or “register” value from a location in the memory bank. They perform the respective additions and emit the sums. The cells in a second column receive these values and commit them respectively. Therefore, FIGS. 4A and 4B are the column-column and column-register variations respectively. These operations only require computation units to exist once per row, and cells need no computational abilities.

In the variation in FIG. 4C, multiple column-register additions are performed, the single values in each the corresponding item from a separate row list. Each cell computes the sum of the item in its row in the column list in its column and the item in its column in the row list and commits it. In another variation, not shown, a single value is the second operand for all the operations, such as to increment the items in a list, a column-register form. In yet another, taking one column list and one row list, the Cartesian product of the lists is evaluated, every pair of numbers from both lists. The sum of each pair is computed, every cell computing the sum of the item in its row in the first list, and the item in its column in the second list, a column-row form, and committing its result.

In FIGS. 5A, 5B, 5C, 5D, 6A, 6B show about conducting the ternary operation which is unique to computing. In it, an expression evaluates to one of two values depending on the comparison of a third value to zero, expressed as “a if b else c”. As in the addition example, one input is a column list, while the second and third arguments can originate in any combination of registers or other columns.

For ternary operations, similarly to binary operations, column-column-row, column-row-column, and column-register-row operations can be defined. 9 combinations of inputs for binary operations in all are possible, some being redundant: the square of the set of column lists, row lists, and register operands, all having multi-column versions. 27 input combinations of ternary operations are possible: the cube of that set.

In FIG. 7, a 0-ary operation is also feasible for parallel operations. The computation units at the end of every row emit their row indices onto horizontal pathways. The cells that will comprise the new column list commit these contents, creating a sequential list of values. Using binary subtraction in the column-register form as described for binary operations above, the new list can be rebased to zero wherever needed. The address copy populates a column list with the row indices of the span of the list. In another implementation, the information of cells' row indices is persistently stored in the cells and consequently available to them.

FIGS. 8A, 8B show simultaneous array indexing operations performed on multiple row lists. The read operation takes one input, a list of column indices to read by individual rows. The indices are emitted along a horizontal pathway. The cells whose column indices equal these values in the respective rows emit their values along a second horizontal pathway, and the cells in a new column commit these contents. The write operation takes two inputs, a list of column indices and a list of values. These are emitted onto two respective horizontal pathways in every row. This time, the cells whose column indices equal the index commit the value. These operations can be interpreted as a generalization of a 3-column list ternary operation. The row lists being read from or written to can be considered inputs to the operation, but it's also consistent that the operation could be used to select a value for each row from multiple column lists.

Next we discuss list-internal operations, which involve transferring data on the vertical axis, by which rows are able to share data in a way.

FIGS. 9A, 9B, 9C describe extensions to content-addressable memory, which does exist prior. To our knowledge, content addressing is conventionally only performed on equality. We introduce the other comparisons and new ways of tabulating multiple matches, including accessing matches by index and collation. In the operation, an entire list is tested simultaneously for success or failure in a comparison by the per-row computation units. The comparison can include both normal equality comparison, and inequality: less than (LT), less than or equal (LE/LTE), unequal (NE), greater than or equal (GE/GTE), and greater than (GT). The value to be sought or compared to exists in a register or single cell in the bank; the computation units access it simultaneously. The items in the list to be searched are emitted along horizontal pathways, and the respective per-row computation units perform the comparison.

The results of the comparison can be made available in many forms, not shown except where noted: 1. The row indices of the items on which the test succeeds are emitted along a horizontal pathway and form a new list, shown in FIG. 9A. 2. The items themselves which succeed form the new list. In (1) and (2), a special value is emitted in the rows in which the test fails, or no value is committed in the new list. 3. A one or a zero is emitted for each row to form a new list to indicate the success or failure of the test. 4. The new list in (3) is summed without committing, returning the total number of matches, a single value to be stored in a register or single cell, as in FIG. 9A. 5. The index or value of the Nth successful match in (1) or (2) is found, a single value, for N also a single value specified in a register or single cell, as in FIG. 9B, called access by “match index”. 6. The results of (1) or (2) are selectively shifted towards one end within the per-row units, skipping the indices or values on which the test failed, to collate the list, as in FIG. 9C. As with the binary operations above, multiple searches can be performed concurrently if cells are enabled with the corresponding comparison and result-handling operations; for a single column list search, the per-row units are sufficient. Collation can be performed in either of the following ways. 6-1. A row list is populated with the indices of the rows in which the test succeeded as in FIG. 7, the process in FIG. 14 is performed on the result list to the indices to form a new row list, then the row list is copied back to a column by the process in FIG. 11. 6-2. The contents of the result list are simultaneously emitted towards the top, each unit committing the Nth value that passes it, for N equal to the number of rows above it in which the test failed.

Combined with the binary operations, the searching operations offer progress on the problem of generating unique keys for sparse-key containers, which otherwise limits their speed. Using the invention, keys are maintained in sorted order, as insertion and removal can be performed in constant time with the process in FIG. 15. To find a unique key, a key is found whose numeric successor is not currently assigned. To do this, the list of keys is temporarily copied into a second column and shifted up one cell with the process in FIG. 15. The differences between the two lists, the original and the shifted copy, are found with binary subtraction, and a pair with a difference greater than one is found using searching. Then any value between the two items is unused and can be assigned. This takes constant time. Alternatively, the difference values can be maintained in a structured pair along with the keys.

FIGS. 10A, 10B introduce “folding” or “reducing” a single list, such as finding the internal sum, product, or minimum of it. The overall operation has only one input, a list; however it is comprised of multiple two-argument operations. In our design, folding has two forms, “direct” and “deferred”. Direct folding is a serial operation, but it proceeds independently of the system clock. Deferred folding constructs, informally, a tournament bracket from the list, performing many individual operations simultaneously and aggregating the results. The aggregated values then participate in further rounds of simultaneous individual operations. Both forms result in a single value. Only the direct form is available for non-associative operations. Interpreting the running time of both forms of the overall operation is vague. The direct form can be interpreted to take either O(N) or O(1) time, as the operation performed on shorter lists can terminate earlier, but still requires the length of time it takes for a signal to travel from one end of a column to the other. This concern is usually outweighed by the overhead of the front-side bus transfer when the operation is performed in serial. The deferred form can be interpreted to take either O(Log N), or O(N Log N) time, as it takes multiple cycles, but the distance the signals travel increases on each step.

In the direct form, as shown in FIG. 10A, the per-row computation units read the list from its original location. The computation unit at the bottom of the range of rows that the list spans emits the item it has read upwards along a vertical pathway that connects the per-row computation units. The next-to-bottom computation unit emits the result of performing the individual operation on the value arriving from the computation unit in the row below, and that arriving from the per-row computation unit below it. When the top computation unit in the range the list spans completes the operation it is to perform, the result is stored in a cell in a new location. Therefore, the operation takes the amount of time it takes for a signal to travel from the bottom of the list to the top, plus any delays from the actual computations. The direction of the sequence varies for non-associative operations, including subtraction, division, and exponentiation.

In the deferred form, as shown in FIG. 10B, available only to associative operations, such as sum and minimum, but not subtraction, the overall operation takes many steps. In the first step, the individual operation is performed on items in the even-numbered rows and the items in the respective next-lower odd-numbered rows simultaneously, and the results are temporarily stored in the even-numbered rows of the pairs. In the next step, the individual operation is performed on items the first step produced in row numbers 0, 4, 8, 12, and so on, and those in rows 2, 6, 10, 14, and so on, simultaneously, and temporarily stored in the locations of the former. Then the individual operation is performed on items the previous step produced in rows 0, 8, 16, 24, and so on, and those in rows 4, 12, 20, 28, and so on, simultaneously. Like a tournament bracket, the items in the last row in the selection receives a bye for rounds in which the next-lower row that would be in the selection isn't contained in the bounds of the list. When only one individual operation is performed in a step, the final result is produced and the larger sequence is complete. For a list of size N, in the first step, N/2 operations are performed, N/4 in the second step, N/8 in the third, and so on, for a total of N−1 operations total, the same as the direct form, in Log N steps. Row numbering can be considered to be offsets from the top row of the list instead of the actual row indices. The general numbering of the rows to participate in a step is A) 2̂5*K and B) 2̂5*K+2̂(S−1), for step numbers S starting at 1 and integers K starting at 0 that produce row numbers in the range of the list.

For extended copying it involves transferring data on both axes. Copying with a conventional front-side bus is painstakingly slow. Some faster mechanisms are described.

In an aligned copy, as stated, the cells containing a list emit their respective items onto horizontal pathways, and cells in a new column are able to commit it. But an aligned copy is only useful in limited cases, as programs only rarely wish to only copy lists to locations aligned with the first. In the operation shown in FIG. 11, a column list is copied to a new row list, creating the possibility of flipping the step in a different row. Initially, two column lists are emitted on horizontal pathways, one containing the items to transpose, the other an index list of consecutive integers, such as produced by the process in FIG. 7. The cells in which the cell's column index equals the item in its respective row in the index list, simultaneously copy the respective items in their rows in the value list onto vertical pathways. The cells in which the indices are not equal do nothing. The cells in a new row simultaneously commit the respective values travelling vertically to storage, and a new row list is created. Cells' row and column indices are also available to them as discussed for FIG. 7.

As shown in FIGS. 12A, 12B, 12C, 12D, a row-to-column transpose is similarly possible. Two row lists are emitted on vertical pathways, one containing values and the other containing indices. The cells in which the cell's own row index equals the item in its respective column in the index list, simultaneously copy the respective items in their columns in the value list onto horizontal pathways, and the cells in a new column commit them. By performing both forms consecutively, the column-to-row form followed by the row-to-column form, a list can be copied to any other location, free of alignment restrictions, in O(1) time, shown in FIG. 12C.

For an ordinary transpose, the index list is an increasing consecutive list of indices, which can be generated in two steps with an address copy and binary subtraction. For the reversing form, shown in FIG. 12B, the index list is consecutive decreasing, similarly generated. The design for the operation also has a variant in which diagonal instruction wires trigger the corner-turning behavior in the cells, shown in FIG. 12D, the opposite diagonal being necessary for the reversing form. The instruction causes one of the wires to carry the activating signal, and the operation takes only the list of values as an argument.

FIG. 13 describes a non-consecutive transpose. As in the process in FIG. 11, two column lists are emitted on horizontal pathways, one containing the items to transpose, but the other an index list of arbitrary integers. The cells in which the cell's own column index equals the item in its respective row in the index list, simultaneously copy the respective items in their rows in the value list onto vertical pathways. The cells in the new row simultaneously commit the respective values travelling vertically to storage if any is present. No commit is performed in the columns in which no value has been emitted onto the vertical pathway by a cell.

FIG. 14 shows a process in the reverse form of the process in FIG. 13. A new column list is created from multiple items in a row list, useful for simultaneous reads from a single list. The arguments are a column list of the indices to be read from the row list, and the row list itself. The cells in the column list simultaneously emit the indices onto corresponding horizontal pathways, and the cells in the row list emit their items onto vertical pathways. The cells in which the index arriving on the respective horizontal pathway equal the cells' respective column indices copy the contents of the vertical pathways onto respective second horizontal pathways. The cells in the column designated to contain the result column list commit the values arriving on the second horizontal pathways.

FIG. 15 describes a byte roll, similar to a bit roll in which a number is shifted by a number of bits, specified in a second argument, and the newly open bits are cleared or replaced by the ones that were shifted off the end. In it, a list is shifted up or down by a number of cells, and the newly open cells are cleared or replaced by the ones that were shifted off the end. The operation is especially applicable to lists that span multiple columns. Rolling and shifting would be useful for a stack in which the top stays at a constant address, and to maintain a list in sorted order.

Improvement to sorting is also described herein. Conventional sorting has been proven to run in O(N Log N) time at best. Our design enables us to perform multiple simultaneous comparisons, while still taking items only two per comparison. Four sorting algorithms are described making use of the our invention, running in O(N), O(N), O(Log N), and O(1) time respectively. All four are stable.

In FIG. 16 adjacent pairs of numbers in a list are compared simultaneously, with alternating pairs selected in successive steps, swapping pairs that are out of order for the next step. N steps are required in the worst case, such as when the largest element starts at the top, and moves down one row in each step. Pairs of items that are already ordered aren't swapped. The comparisons occur alternately in even- and odd-numbered rows taken as relative indices in the list. The top and bottom rows receive byes if they're unpaired in that step.

In FIG. 17, every item in a list is visited once and copied to its progressively sorted location in a new list. The insertion point for an item is found with a simultaneous list-wide inequality comparison in O(1) time, similar to a content-addressable operation on inequality. The list is shifted down one cell from that point down in O(1) time by the process in FIG. 15, and the item is inserted into the newly open cell.

As shown in FIGS. 18 and 19, in the concurrent merge sort, progressively larger sorted regions of a list are created with a simultaneous merge operation replicated on all the regions, sorting the list in O(Log N) repetitions of the operation in O(Log N) time, the merge step taking O(1) time. After the first round, sorted lists of size 2 will be created; after the second round, sorted lists of size 4 will be created; size 8 after round 3, etc. The merge step merges two initially sorted lists to create a new sorted list their combined size in constant time, replicated on every region simultaneously that has been sorted up to that point. In the merge step, each item is moved to its new ordered location by the process in FIG. 13. Its new location is the sum of the item's index within the input list in which it occurs and the number of items in the other input list that are less than it; or for the second list, less than or equal.

The merge step only is depicted in the diagrams: two sorted lists of size 8 are merged into a list of size 16. The bottom list is transposed to be perpendicular to the top list, FIG. 18 Step A, and the top list is compared to the bottom list for inequality, FIG. 18 Step B. The less than or equal to (LE) symbol indicates the cells in which the items in the cell's row in the column list and its column in the row list pass the test. The relative column indices of the leftmost cells in each row which pass the test are stored in a temporary column, or the length of the row list if none do, FIG. 18 Step C; in some interpretations this step takes O(N) time, affecting our interpretation of the running time overall. A counting sequence, or the relative row indices of the top list, is simultaneously added to the respective comparison results to obtain the respective new indices of the items in the top list, FIG. 18 Step D. The process is repeated for the bottom list, exchanging the respective lists, and substituting the LT comparison for LE to resolve collisions, FIG. 18 Steps E-H. The top list and new indices are transposed into horizontal lists, and the items in the top list are stored to their new positions in the new column list by the process in FIG. 13, shown in FIG. 18 Step I, and repeated for the bottom list, FIG. 18 Step J. The result list is fully populated, and the merge step is complete. LT, LE, and EQ capabilities are required in a square of cells the size the input lists on a side, LT and LE for the comparisons and EQ for the non-consecutive transpose.

As shown in FIG. 20, in a Cartesian sort, a list is sorted in O(1) time. The method is named for the Cartesian product of the list to be sorted and itself, performing a comparison on every pair in it: each item is compared to the rest of the list. Then the items are compared to only items that occur earlier in the list than their original locations for equality, to preserve the order among equal items. An item's new ordered index is the total number of tests it passes.

A copy of the list is transposed into a row. A greater than (GT) test, shown in FIG. 20 Step A, is performed in the area of the square formed, comparing the column on the left and the row on the right. The GT symbol indicates a true outcome of the comparison, while a dot indicates a false one. The respective successful tests for each row are counted and stored in a temporary column list, FIG. 20 Step B; in some interpretations this step takes O(N) time, affecting our interpretation of the running time overall. An equality (EQ) test is performed next, but only performed on the lower-left triangle of the square excluding the diagonal, shown in FIG. 20 Step C. The respective successful tests for each row are then counted, also taking O(N) time in some interpretations, and stored in a temporary column list, FIG. 20 Step D. The two temporary lists are added with 2-column binary vector addition and stored in a third temporary column list, FIG. 20 Step E. Then the third list is transposed to align with the transposed original list, FIG. 20 Step F. Then the values in the original list are copied into their newly sorted locations using the third temporary list as indices for the process in FIG. 13, shown in FIG. 20 Step G. GT, EQ, and comparison counting capabilities are required in a square of cells the size of the list on a side for the operation as depicted. The serial equivalent of a Cartesian sort would take O(N̂2) time.

We contemplate the architecture instruction set to grow significantly when dealing with multiple lists. Selection of the rows and columns to participate in a given operation is expected to consume many bits in an instruction. Every instruction has as many as 5 or more parameters: a top row index, a bottom row index, 0 to 3 input column indices, and an output index. With the cell-wise variations, column selection takes a left column index and an additional right column index. In the vector operations, selection of the column to contain the result can also be specified by a range, with a left and right column index; the result would be copied into several new lists, creating multiple copies, though aligned copying to a selection of columns could be performed in a separate step. Then combined with the register-argument unary operation, one value can be made to fill an entire rectangular region in O(1) time.

Even further parameters can be specified in the domain of single-instruction multiple-data, in particular stepping and substepping in both dimensions for particular operations. This can be accomplished with the ternary operation, performing an operation throughout a range of rows, then selecting between the results and the original contents based on stored or computed selection criteria, but it could also participate directly in the instruction set. Due to the vast improvement in speed, comprising an instruction with multiple system words, due to the larger instruction set, might be faster than conditionally selecting results. However, the conditional selection method allows more advanced criteria, such as early termination of numerical procedures, such as once a tolerance is reached. For example, with the column-register-register form of the ternary operation, a new list can be populated with 1's or 0's to indicate the success or failure of a condition.

It is contemplated that the central processor would continue to drive the operations in the bank. However, we are not far from eliminating it along with the bus. By placing an instruction counter in the memory bank, the execution units in the bank can be run asynchronously from the central clock for brief periods to increment through a sequence of instructions also present in the bank. By designating one row as primary and adding registers, there is little difference from a CPU. Instruction counters could even be placed in every row: in this case, rows could execute independent operations, so long as they don't need access to data in other rows.

We use addition above as an example of binary and folding operations, and negation as an example of a unary operation. These were examples only. The following table of operations expresses more thoroughly the selection of operations that can be enabled in any or all rows, columns, and cells. The table is arranged by number of arguments and operation type. The results are not counted as an argument. The operations can be performed in parallel as described, multi-column variations taking the corresponding extra arguments:

0 arguments: column list or row list:

-   -   Address copy

1 argument: column list or row list:

-   -   Direct and deferred folding, associative:         -   Add, multiply         -   Logical and, or, xor         -   Bitwise and, or, xor         -   Min, max     -   Direct only folding, non-associative:         -   Subtract, divide, exponent, modulo         -   Logical nand, nor, nxor, imp         -   Bitwise nand, nor, nxor, imp     -   Concurrent bubble sort     -   Concurrent insertion sort     -   Concurrent merge sort     -   Cartesian sort

1 argument: column list, row list, or register:

-   -   Identity (register fill or list copy)     -   Additive inverse (negative)     -   2's compliment     -   Logical inverse (Boolean not)     -   Bitwise inverse (1's compliment/bitwise not)     -   Trig:         -   sin, cos, tan, sec, csc, cot (4 more)         -   sin h, cos h, tan h, sech, csch, cot h         -   arcsin, arccos, arctan, arcsec, arccsc, arccot         -   arsinh, arcosh, artanh, arsech, arcsch, arcoth     -   log, abs, floor, ceil, ln

2 arguments: 2 column lists:

-   -   Transpose     -   Non-consecutive transpose

2 arguments: 1 column list, 1 row list:

-   -   Reverse non-consecutive transpose

2 arguments: column list(s)+row list(s)+register(s):

-   -   Add, subtract, multiply, divide, exponent, modulo     -   Comparison:         -   Less than (LT), less than or equal to (LTE), equal to             (EQ/NXOR),         -   greater than or equal to (GTE), greater than (GT),         -   not equal to (NE/XOR)     -   Bit shift left and right, bit roll left and right     -   Trig:         -   2-arg a tan (a tan 2)     -   2-arg log     -   Sign extension     -   Search (Content-addressing)     -   Logical and, or, xor, nand, nor, nxor, imp     -   Bitwise and, or, xor, nand, nor, nxor, imp     -   Min, max

2-3 arguments: 1 list, 1-2 registers (second register for filling):

-   -   Word shift up and down, word roll up and down

3 arguments: column list(s)+row list(s)+register(s):

-   -   Conditional value: if a then b else c

Per-row arguments: 1 column list+1 row list per row:

-   -   Concurrent list element read

Per-row results: Arguments: 2 column lists+1 row list per row:

-   -   Concurrent list element write

The potential applicability of the invention is very wide. We expect such applications will include:

-   -   Graphics: vertices, splines, pixels used in video memory,         shading, rendering, polygons, spatial transformations, and ray         tracing     -   Image processing and manipulation and “filter” effects     -   Media and data compression and codecs     -   Audio data processing and manipulation: array indices and the         sine function in constant time for multiple sample points     -   Signal processing and signal encoding and decoding     -   Robotics and neural nets     -   Fourier analysis and transforms     -   Numerical methods and computations     -   Formal languages and finite automata     -   Taylor series and differential equations     -   Trigonometry and integrals     -   Modeling and simulation     -   Statistics and financial     -   Cryptography     -   Matrix multiplication and sparse matrices

As will be recognized by those skilled in the art, the innovative concepts described in the present application can be modified and varied over a tremendous range of applications, and accordingly the scope of patented subject matter is not limited by any of the specific exemplary teachings given. It is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims.

None of the description in the present application should be read as implying that any particular element, step, or function is an essential element which must be included in the claim scope: THE SCOPE OF PATENTED SUBJECT MATTER IS DEFINED ONLY BY THE ALLOWED CLAIMS. Moreover, none of these claims are intended to invoke paragraph six of 35 USC section 112 unless the exact words “means for” are followed by a participle.

The claims as filed are intended to be as comprehensive as possible, and NO subject matter is intentionally relinquished, dedicated, or abandoned. 

What is claimed is:
 1. An electronic memory device, comprising: a plurality of memory cells configured in a grid having a plurality of rows and columns; a plurality of horizontal pathways, each connecting between neighboring individual memory cells in said plurality of rows respectively; and a plurality of vertical pathways, each connecting between neighboring individual memory cells in said plurality of columns respectively, wherein each individual of said plurality of memory cells is configured to function as a storage unit or a computation unit or both, and computation operations are configured to be performed in situ said plurality of memory cells.
 2. The electronic memory device of claim 1, wherein multiple simultaneous computation operations are configured to be performed in situ said plurality of memory cells.
 3. The electronic memory device of claim 1, wherein a row of said memory cells are configured to be computation units, and a primary computation unit is connected to said row via a horizontal pathway.
 4. The electronic memory device of claim 1, wherein a column of said memory cells are configured to be computation units, and a primary computation unit is connected to said column via a vertical pathway.
 5. The electronic memory device of claim 1, wherein said plurality of horizontal or vertical pathways are configured in two logical dimensions along which contents of memory cells are configured to be transferred.
 6. The electronic memory device of claim 5, wherein said horizontal pathways connecting computation units are configured to transfer data.
 7. The electronic memory device of claim 1, wherein said computation operations include Unary operations, Binary operations in 9 configurations given by the square of the set {column list, row list, register}, Ternary operations in the 27 configurations given by the cube of the set {column list, row list, register}, Address copy, Multiple result columns, Offset read, Offset write, Addressable content operations with equality criteria, Addressable content operations with inequality criteria, Producing the indices of (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria), Producing the values that are matched in (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria), Producing Boolean values indicating the outcome of the test in (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria), Counting the results of (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria), Access to the results of (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria) by indices counted among the results only, Shifting or collating the results of (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria) to remove entries for failed outcomes, Direct folding, Deferred folding, Transpose, Reverse transpose, Offset transpose, Reverse offset transpose, Word shift, Concurrent bubble sort, Concurrent insertion sort, Concurrent merge sort, and/or Cartesian sort, or the combination thereof.
 8. The electronic memory device of claim 1, wherein a plurality of neighboring columns of said memory cells are configured to function as computation units, and a plurality of neighboring rows of said memory cells are configured to function as computation units, and a single computation operation is dividedly configured to be conducted simultaneously between said neighboring columns of said memory cells or to be conducted simultaneously between said neighboring rows of said memory cells.
 9. The electronic memory device of claim 8, wherein said single computation operation is related to Unary operations, Binary operations in 9 configurations given by the square of the set {column list, row list, register}, Ternary operations in the 27 configurations given by the cube of the set {column list, row list, register}, Address copy, Multiple result columns, Offset read, Offset write, Addressable content operations with equality criteria, Addressable content operations with inequality criteria, Producing the indices of (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria), Producing the values that are matched in (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria), Producing Boolean values indicating the outcome of the test in (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria), Counting the results of (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria), Access to the results of (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria) by indices counted among the results only, Shifting or collating the results of (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria) to remove entries for failed outcomes, Direct folding, Deferred folding, Transpose, Reverse transpose, Offset transpose, Reverse offset transpose, Word shift, Concurrent bubble sort, Concurrent insertion sort, Concurrent merge sort, and/or Cartesian sort, or the combination thereof.
 10. The electronic memory device of claim 1, further comprising: an independent instruction counter configured to be located in one of said memory cells or a row of said memory cells or a column of said memory cells, wherein said independent instruction counter contains addresses of a computation operation.
 11. The electronic memory device of claim 1, further comprising: a set of unique keys configured to be located in one of said memory cells or a row of said memory cells or a column of said memory cells, wherein said set of unique keys link to a set of associative containers.
 12. A method for conducting computation operations in memory bank, comprising the steps of: constructing a memory bank having a plurality of memory cells configured in a grid having a plurality of rows and columns; constructing a plurality of horizontal pathways on said memory bank, each connecting between neighboring individual memory cells in said plurality of rows respectively; and constructing a plurality of vertical pathways on said memory bank, each connecting between neighboring individual memory cells in said plurality of columns respectively, wherein each individual of said plurality of memory cells is configured to function as a storage unit or a computation unit or both, and computation operations are configured to be performed in situ said plurality of memory cells.
 13. The method for conducting computation operations in memory bank of claim 11, wherein multiple simultaneous computation operations are configured to be performed in situ said plurality of memory cells.
 14. The method for conducting computation operations in memory bank of claim 11, wherein a row of said memory cells are configured to be computation units, and a primary computation unit is connected to said row via a horizontal pathway.
 15. The method for conducting computation operations in memory bank of claim 11, wherein a column of said memory cells are configured to be computation units, and a primary computation unit is connected to said column via a vertical pathway.
 16. The method for conducting computation operations in memory bank of claim 11, wherein said plurality of horizontal or vertical pathways are configured in two logical dimensions along which contents of memory cells are configured to be transferred.
 17. The method for conducting computation operations in memory bank of claim 15, wherein said horizontal pathways connecting computation units are configured to transfer data.
 18. The method for conducting computation operations in memory bank of claim 11, wherein said computation operations include Unary operations, Binary operations in 9 configurations given by the square of the set {column list, row list, register}, Ternary operations in the 27 configurations given by the cube of the set {column list, row list, register}, Address copy, Multiple result columns, Offset read, Offset write, Addressable content operations with equality criteria, Addressable content operations with inequality criteria, Producing the indices of (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria), Producing the values that are matched in (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria), Producing Boolean values indicating the outcome of the test in (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria), Counting the results of (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria), Access to the results of (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria) by indices counted among the results only, Shifting or collating the results of (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria) to remove entries for failed outcomes, Direct folding, Deferred folding, Transpose, Reverse transpose, Offset transpose, Reverse offset transpose, Word shift, Concurrent bubble sort, Concurrent insertion sort, Concurrent merge sort, and/or Cartesian sort, or the combination thereof.
 19. The method for conducting computation operations in memory bank of claim 11, wherein a plurality of neighboring columns of said memory cells are configured to function as computation units, and a plurality of neighboring rows of said memory cells are configured to function as computation units, and a single computation operation is dividedly configured to be conducted simultaneously between said neighboring columns of said memory cells or to be conducted simultaneously between said neighboring rows of said memory cells.
 20. The method for conducting computation operations in memory bank of claim 19, wherein said single computation operation is related to Unary operations, Binary operations in 9 configurations given by the square of the set {column list, row list, register}, Ternary operations in the 27 configurations given by the cube of the set {column list, row list, register}, Address copy, Multiple result columns, Offset read, Offset write, Addressable content operations with equality criteria, Addressable content operations with inequality criteria, Producing the indices of (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria), Producing the values that are matched in (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria), Producing Boolean values indicating the outcome of the test in (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria), Counting the results of (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria), Access to the results of (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria) by indices counted among the results only, Shifting or collating the results of (Addressable content operations with equality criteria) and (Addressable content operations with inequality criteria) to remove entries for failed outcomes, Direct folding, Deferred folding, Transpose, Reverse transpose, Offset transpose, Reverse offset transpose, Word shift, Concurrent bubble sort, Concurrent insertion sort, Concurrent merge sort, and/or Cartesian sort, or the combination thereof.
 21. The method for conducting computation operations in memory bank of claim 11, further comprising: an independent instruction counter configured to be located in one of said memory cells or a row of said memory cells or a column of said memory cells, wherein said independent instruction counter contains instructions for a computation operation.
 22. The method for conducting computation operations in memory bank of claim 11, further comprising: a set of unique keys configured to be located in one of said memory cells or a row of said memory cells or a column of said memory cells, wherein said set of unique keys link to a set of associative containers. 